*From Starter code
HappyDB is a corpus of 100,000 crowd-sourced happy moments via Amazon’s Mechanical Turk. You can read more about it on https://arxiv.org/abs/1801.07746
In this R notebook, we process the raw textual data for our data analysis.
From the packages’ descriptions:
tm is a framework for text mining applications within R;tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures;tidytext allows text mining using ‘dplyr’, ‘ggplot2’, and other tidy tools;DT provides an R interface to the JavaScript library DataTables.## _
## platform x86_64-w64-mingw32
## arch x86_64
## os mingw32
## system x86_64, mingw32
## status
## major 3
## minor 4.3
## year 2017
## month 11
## day 30
## svn rev 73796
## language R
## version.string R version 3.4.3 (2017-11-30)
## nickname Kite-Eating Tree
urlfile<-'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/cleaned_hm.csv'
hm_data <- read_csv(urlfile)
We clean the text by converting all the letters to the lower case, and removing punctuation, numbers, empty words and extra white space.
Stemming reduces a word to its word stem. We stem the words here and then convert the “tm” object to a “tidy” object for much faster processing.
We also need a dictionary to look up the words corresponding to the stems.
We remove stopwords provided by the “tidytext” package and also add custom stopwords in context of our data.
Here we combine the stems and the dictionary into the same “tidy” object.
Lastly, we complete the stems by picking the corresponding word with the highest frequency.
We want our processed words to resemble the structure of the original happy moments. So we paste the words together to form happy moments.